A macroevolutionary analysis of European Late Upper Palaeolithic stone tool shape using a Bayesian phylodynamic framework

Phylogenetic models are commonly used in palaeobiology to study the patterns and processes of organismal evolution. In the human sciences, phylogenetic methods have been deployed for reconstructing ancestor–descendant relationships using linguistic and material culture data. Within evolutionary archaeology specifically, phylogenetic analyses based on maximum parsimony and discrete traits dominate, which sets limitations for the downstream role cultural phylogenies, once derived, can play in more elaborate analytical pipelines. Recent methodological advances in Bayesian phylogenetics, however, now allow us to infer evolutionary dynamics using continuous characters. Capitalizing on these developments, we here present an exploratory analysis of cultural macroevolution of projectile point shape evolution in the European Final Palaeolithic and earliest Mesolithic (approx. 15 000–11 000 BP) using a Bayesian phylodynamic approach and the fossilized birth–death process model. This model-based approach leaps far beyond the application of parsimony, in that it not only produces a tree, but also divergence times, and diversification rates while incorporating uncertainties. This allows us to compare rates to the pronounced climatic changes that occurred during our time frame. While common in cultural evolutionary analyses of language, the extension of Bayesian phylodynamic models to archaeology arguably represents a major methodological breakthrough.


Introduction
Phylogenies are a vital tool in palaeobiology for understanding the patterns and processes of organismal evolution, diversification, extinction and adaptation [1,2].The introduction of formal cladistics in the 1960s [3], ever-increasing computing power [4,5] and more recent introductions of powerful Bayesian statistical frameworks [6][7][8][9] have had a profound impact on the quantitative rigour and reproducibility of palaeobiological analyses.In the human sciences, the use of phylogenetic methods to reconstruct and analyse historical ancestor-descendant relationships based on cultural traits on the whole lags somewhat behind in terms of the breadth of application and its analytical sophistication.Notably, linguists have long been aware of the similarities between language trees and organismal phylogenies [10,11].Consequently, and thanks to the systematic compilation of discrete lexical data, phylogenetic analyses [12][13][14]-including those using Bayesian methods-are now commonplace in historical linguistics, and have made singular appearances in other disciplines related to cultural evolution, such as musicology [15].
Archaeology is another human science in which phylogenetic analysis has gained some traction in recent decades.Spurred by the development of, initially, the notion of the extended phenotype and later the emergence of cultural evolutionary theory [16,17], archaeologists have adopted phylogenetic methods to model and understand material culture evolution [18][19][20][21][22][23][24][25][26]-Bayesian phylogenetic methods, however, remain the exception [27][28][29][30][31]. Humanly made objects-artefacts-can be described in many ways and archaeologists have devised countless, often not readily compatible qualitative and quantitative ways of doing so.Chief among these is typology, a pre-computational and essentialist classification practice that partitions variation into presumed average forms [32,33].
Inspired by the debate about unit definition in biology (e.g.[34,35]), modern artefact phylogenetics has attempted to break with this classificatory approach, arguing that partitioning individual artefacts into a collection of traits that together describe them circumvents the unduly reifying tendencies of typological classification (cf.[36]).In doing so, however, continuous characters are commonly discretized into arbitrary bins that also do not fully describe the observed variation (e.g.[33]).Again inspired by palaeobiology, archaeologists have also employed landmark-based geometric morphometrics (GMM) to describe artefact shapes (e.g.[37,38]).While this mitigates the drawbacks of using qualitative or discretized traits, landmark placement on artefacts nonetheless remains difficult [39].Most recently, approaches using whole-outline GMM-which is entirely independent of landmark placement and richly describes artefact shape-have been benchmarked against previous studies [40].While the conceptual basis for using such continuous characters (e.g.principal components) in phylogenetic analyses has been firmly established [41][42][43][44][45], pipelines for actually doing so within a Bayesian framework have hitherto been lacking [46][47][48], stymieing both organismal and artefactual phylogenetics.Fresh developments in Bayesian phylogenetics-specifically the development of the computational environments of BEAST2 [8] and RevBayes [9]-now offer flexible analytical pipelines that facilitate the construction of phylogenies from continuous characters (e.g.[49]).Furthermore, we now have statistically coherent models for the analysis of sampling data through time.
Models that combine diversification and sampling processes have also undergone massive development over the past decade.Within palaeobiology, a significant step forward was the introduction of the fossilized birth-death (FBD) process [50,51].This model explicitly includes the fossil sampling process, which accounts for the possibility that extinct samples can appear in the tree as tips or samples along internal branches (sampled ancestors).The FBD process has already been applied to study language evolution (e.g.[52]), and here, we apply this model for the first time within an archaeological context.The use of phylodynamic models further expands the information we can obtain from phylogenetic trees.While phylogenetics seeks to estimate the evolutionary relationships among samples, phylodynamics seeks to estimate variation in the dynamics that generated our trees [53].For example, skyline birth-death process models allow for variation between discrete time intervals in the diversification and sampling rates [54].Applied within a Bayesian framework, we can account for uncertainty in the tree topology and divergence times, meaning we can characterize the diversification dynamics without perfect knowledge of the underlying phylogeny.Phylodynamics is often applied in the context of epidemiology where, for example, low mutation rates might prevent us from inferring a fully resolved transmission tree, but we can still estimate variation in transmission rates [55].In the context of macroevolution, we can apply the FBD skyline model to obtain estimates of speciation, extinction and fossil sampling rates through time [56].
Our study reports on the very first such application of phylodynamics to archaeological data, specifically to knapped stone projectile points from the European Late Upper Palaeolithic (LUP).The European LUP (ca 15-11.5 ka BP) falls into a time of extremely volatile climate and changing environments.The period starts at the end of the Last Glacial Maximum (GS−2), which is followed by a phase of abrupt warming (GI−1), in which icesheets regressed significantly, new landscapes opened up, and flora and fauna underwent considerable changes.Following one-and-a-half millennia of intermittent warming, this interglacial was cut short by the onset of Greenland Stadial 1 (GS−1) at approximately 12.9 ka BP, with a deterioration in climate and the return to very cold, dry and windy conditions, particularly in northern Europe.With the termination of this stadial at approximately 11.7 ka BP, the Pleistocene ended, and the Holocene began (figure 1; [57]).
Archaeologists have long studied the impact of climatic changes on human communities during this time (e.g.[58][59][60]).In comparison with preceding periods of the European Upper Palaeolithicthe iconic Magdalenian-it is argued that the LUP underwent both simplifications of and losses in technology [61] as well as, eventually, a diversification of regional forms.Such assessments are, however, derived chiefly from studies conducted at the scale of sites or regions.Yet, it is arguably at the macro-scale at which the archaeological record can truly shine (e.g.[62]).In fact, previous studies comparing LUP cultures and artefact forms quantitatively across European regions have already shown that many traditional archaeological cultures are beset by issues of epistemic or empirical inadequacy (e.g.[63][64][65][66]).
Despite the evident benefits of macro-archaeological studies-especially when combined with cultural evolutionary theory-these remain rare [67].The often regionally bespoke and mutually incompatible classifications of archaeological material constitute a major barrier to trans-regional, macro-scale studies.The complex socio-political landscape of past and present Europe further complicates the situation, with the result being a patchwork of archaeological cultures whose analytical utility varies [64,68].A recent initiative has tested the robustness of these existing archaeological cultures empirically [66], and in doing so, collated an extensive dataset covering many key sites of the European LUP [69].Our analysis draws on this dataset, but we restrict our analysis to a highly resolved subset of chronologically constrained artefacts, namely projectile points, from such key sites.Based on the tanged/stemmed design of these projectile points, it is reasonable to assume that they were axially hafted and that they served as the working end of hunting weapons that were ostensibly instrumental in the adaptations of these Terminal Pleistocene communities, and that the specifics of their design also reflect aspects of cultural transmission.In focusing on artefact form, we do not deny the utility of discrete (often qualitative) traits-motivated by and derived from detailed technological analyses (cf.[70,71])-in building artefact phylogenies.Large and representative datasets of such traits tested for intra-and inter-observer bias do not exist at present, however.Moreover, artefact shape remains one of the critical phenomenological traits used to distinguish many, if not most, projectile types, underlining the powerful yet mostly untested role that this trait alone plays in traditional cultural taxonomy.
Building on previous work [49], we here take initial exploratory steps towards the Bayesian phylodynamic inference of stone tool phylogenies.A vital contribution here lies first in the benchmarking of these computationally demanding methods, the different models, the data and their interaction.Second, inferring a Bayesian phylogeny from artefact shapes directly without any a priori cultural associations opens up the possibility for the macro-scale study of cultural evolution, including a direct quantitative assessment of the rates of artefact lineage diversification and extinction, and morphological disparity [72]-especially so across the climatically volatile periods of the Terminal Pleistocene, across the many transitional cultural phases, and across the many topographically and biogeographically variable regions of Europe.Our results offer new data-driven insights into the cultural evolutionary dynamics at the end of the Pleistocene in Europe.Notably, the resulting artefact phylogeny maps onto patterns of human dispersal where isolation by distance and climate pressure result in diversification, whereas periods of climatic amelioration and population growth appear to be characterized by a loss of diversity in artefact forms.Our results also align, albeit not perfectly, with recent palaeogenomic results [73] that suggest relations between southern and central Europe.

Artefacts and outline data
We draw upon a published dataset [69] containing European Late Palaeolithic/earliest Mesolithic lithic technology, toolkits and artefact shapes together with their associated metadata.This dataset had been designed for the study of stone tool technology and artefact shape evolution across Europe for the time frame of approximately 15-11 ka BP.It contains, among other data, the two-dimensional outline shapes of 3512 expert-sourced projectile points stemming from key archaeological sites (figure 2).
Lithic projectile points have long played a major role in the periodization of the Late Palaeolithic/earliest Mesolithic of Europe, and in inferring the 'ethnogeographic variability' [74] of presumed contemporaneous communities.Likewise, they are commonly linked to specific hunting strategies, and changes in projectile point forms are interpreted to index changing adaptations.By the same token and with reference to ethnographic observations that link different projectile point shapes to community identities (e.g.[75][76][77]), lithic point shapes are used by many archaeologists as paradigmatic cultural markers in much the same way that index fossils are used in biostratigraphy.These insights are supplemented with archaeological reconstructions of social learning scenarios that arguably document the transmission of knowledge and know-how from elders to youngsters in this period [78][79][80][81].Through meticulous refitting and technological reverse engineering, these studies suggest that particular stone-working recipes-operational schemata-were passed from generation to generation [82].In this way, artefact shapes may reflect parts of such recipes and can hence be used to infer transmission histories [33,83,84].In contrast to many lithic artefacts, whose evolution has hitherto been addressed using phylogenetics, the European LUP stone points appear to have been expedient components in the total weapon system; they were replaced rather than resharpened upon breakage.
After a thorough quality assessment of those shapes, we retained a subset of 985 complete armatures-overwhelmingly unifacial projectile tips manufactured on blades-from sites for which radiocarbon dates are available and whose dating quality was deemed 'reliable' (i.e.dates derived from stratified, in situ layers, in the [69] dataset).For each unique key site/layer combination with radiocarbon dates available, we randomly drew one single specimen, resulting in 87 artefacts representative of the whole research area (figure 3).This way of subsampling reflects the assumptions of the FBD process, assuming a Poisson sampling process along each lineage or branch.This means that a given lineage can only be sampled once at any given time point.All artefacts from the same unique key site/layer combination can be considered equivalent to abundance data in palaeontology.This means we have multiple individuals per lineage at a given locality/layer, corresponding to a single point or short interval in time.
Using the R [85] package Momocs [86], we extracted the two-dimensional whole outlines from all 87 artefacts and combined them with their metadata.As necessary, artefacts were manually reoriented to a standardized direction, and the outlines were centred and scaled.Elliptic Fourier transformation was applied to describe the artefacts' shape contours in harmonics.The first point of the outlines was set to be homologous.We used 28 harmonics to describe 99.9% of the harmonic power.These data were then transferred via a principal components analysis (PCA) for dimensionality reduction to arrive at new, GS-2 de-correlated variables.This resulted in 87 principal component axes to describe 100% of the outlines' variation (figure 4a). Figure 4b shows the specific shape aspect captured by the first nine PC axes.

Radiocarbon dates
Radiocarbon dates for the sites were collected from the Radiocarbon Palaeolithic Europe Database [87], and other site-relevant literature cited in the original dataset [69].All radiocarbon dates were inspected for reliability both in the context of the original data compilation [69] and again for the present paper.The dates were then calibrated in R using the rcarbon package [88] and the IntCal20 calibration curve [89].For each unique key site/layer combination, we calculated the summed probability distributions (SPDs) using the same package.For models without age uncertainty, the median age of the calibrated SPD was used as the single age to which each key site/layer should be dated.For models with age uncertainty, described below, the minimum and maximum of the one-sigma region of the calibrated SPD was used for the range of uncertainty for each key site/layer's date (figure 5).1).The artefacts have been subsampled from the database published in [69], reoriented, centred and scaled.Each of the artefacts represents one unique key site/layer combination for which radiocarbon dates were available.In this visualization, the scales vary between the sub-plots.

Phylodynamic inference
varies across the tree.The tree (phylodynamic) model describes the process of branching (birth), extinction (death) and lineage sampling that generated the tree.It is the tree model that incorporates age information, which, combined with the clock model, allows us to transform evolutionary distances into time.We can also estimate the phylodynamic parameters associated with the tree model, including variation in rates through time.

Tree model
For our tree model, we chose the FBD process [50,51].We estimated a time-calibrated tree using the Bayesian phylogenetic software BEAST2.6 [8], using the FBD tree model, which combines the diversification (birth, death) and sampling processes.The FBD model allows for the explicit incorporation of fossil samples, i.e. specimens sampled before the present (t = 0).Samples with t > 0 can either appear on terminal branches (tips) or along branches as sampled ancestors, and their placement can be estimated during inference using the sampled ancestors (SA) package [56].The model can be used for phylogenetic tree inference of extinct and extant samples or extinct samples only.The FBD process allows us to use the age information of all artefacts and jointly estimate trees and diversification rates.We used the 'canonical' parametrization, with priors on the birth, death and sampling rate parameters.
For each of these parameters we use an exponential prior, Exp (10), with a mean of 0.1.The birth rate (λ) is the rate at which new lineages are added to the tree.The death rate (μ) is the rate at which lineages are terminated.The fossil sampling rate (ψ) is the rate of sampling-through-time along lineages, i.e. for samples, t > 0. An exponential distribution places most of the prior probability on values close to zero but the diffuse tail of the distribution does not preclude much larger values.This choice was based on the order of magnitude of rates that are typically estimated in the context of macroevolution (e.g.[92][93][94]).We set the extant species sampling probability (ρ) to zero, as all our samples existed before the present (t > 0).We used a uniform prior on the origin time of the process, U(0, ∞).

Character evolution model
Continuous traits have been widely used for phylogenetic comparative analyses (i.e.studying trait evolution along an existing time-calibrated tree) and have more recently also been used for tree inference and divergence dating [47].For BEAST2, the contraband package v. 0.0.1 was recently developed [95], which we deployed to use a multivariate Brownian motion model to infer the evolution of the continuous shape traits under the assumption of a random walk.This model has three parameters that we estimated: the root values for each trait, the variance (or rate) parameter of the Brownian motion process for each trait (σ 2 ), and the among-character covariance.For the prior on the root values, we used a normal distribution N(0, 2), for the variance parameter σ 2 , we used a lognormal distribution LN(1, 0.3), and for the covariance parameter we used a uniform distribution, U(−1, 1), following the examples in [95].

Clock model
For the clock model, we experimented with two separate relaxed clock models: The ULNC and the nCat2 model.The ULNC model is an uncorrelated lognormal clock, which allows each branch rate to be independent of the rates of its ancestors or neighbours.Under this model, the rate of any particular branch is drawn from a lognormal distribution, with the mean and standard deviation of  Each date is derived from the calibrated SPD of all available radiocarbon dates for each site/layer combination.For models with age uncertainty, the one-sigma ranges of the SPDs were taken as the minimum and maximum ages.Otherwise, the median ages of the SPDs were used.
this distribution estimated from the data.We used an exponential prior on the mean rate, Exp (10), with a mean of 0.1 and a gamma prior on the s.d.Gamma(0.54,0.38), which has a mean of 0.2.These choices reflect a prior belief that the overall clock rate and variance in the clock rate will both be relatively low but it does not preclude alternative scenarios.The nCat2 model allows branches to be assigned to different rate categories, in this case two rate categories.The rate category values and assignments are estimated.For the prior on the rate values, we used an exponential distribution, Exp (5), with a mean of 0.2.We used a uniform prior on the rate assignments, meaning each branch has a uniform probability of being assigned to either rate category.In both cases, we assume that the overall clock rate was shared across traits.Due to the independence in clock rates, it is possible for these kinds of clock models to account for relatively accelerated bursts of morphological evolution along a given branch.

Skyline analysis
To estimate diversification (birth, death) and sampling rates in different time intervals, we used the FBD skyline model from the BDSky package [56] in BEAST2.This model allows for piecewise constant rate variation through time, meaning that discrete intervals have independent rates.For the skyline models, we chose interval boundaries at 14 600, 12 900 and 11 700 BP to compare the rates between the period before the end of the Last Glacial Maximum (GS−2), the Late Glacial Interstadial (GI−1), the Younger Dryas (GS−1) and the early Holocene, respectively.We parametrized the model using the birth (λ), death (μ) and sampling rates (ψ), as described above, and calculated the diversification (d) and turnover (t) rates post hoc.The diversification rate is the rate at which the tree grows and can be defined as The turnover rate measures the rate at which lineages are replaced and can be define as Note that, d is within the interval (-∞, ∞) and t is within the interval (0, ∞).We calculated the median of the estimated 95% highest posterior density (HPD) intervals for all rates (λ, μ, ψ, d, t) within each time bin using the coda R package [96].

Specimen ages
As the specimens' starting ages and/or fixed ages we chose the median age of each specimen, derived from their respective calibrated radiocarbon SPDs.To account for age uncertainty associated with specimens, we applied a uniform prior between the oldest and youngest radiometric dates of the one-sigma age range from the SPD of the calibrated radiocarbon dates collected from each artefact's respective layer, as described above.An overview of the different scripts is given in the electronic supplementary material, table S1.All ages were divided by 1000 and the results rounded to three decimals.

Testing different combinations of taxa and traits
To study the influence of the number of taxa and traits on the clock rates, we ran several different combinations of taxa and traits.For this, we created four different sets of taxa.One set contains all available artefacts (87 taxa), and the other three datasets have been subsampled in a stratified way, to contain comparable numbers of taxa per chronological interval (4, 8 and 16 per event), resulting in 16, 32 and 60 taxa, in addition to the dataset with 87 taxa.For each of these sets, we experimented with the number of traits and selected the first 2, 3, 6, 9, 10, 20, 44 (corresponding to 50% of the total amount of PC axes rounded up) and all PC axes (87) as traits.As figure 4a shows, 99% of the dataset's total variation is described by the first nine PCs, underlining the importance of experimenting with variable trait numbers.Beyond the convenience that decreasing the number of traits reduces computing time until convergence, importantly, a recent study [97] has also shown that the inclusion of more principal components can lead to an increase in noise and an attendant reduction in phylogenetic signal.

Markov chain Monte Carlo settings
For each script, we ran two independent Markov chain Monte Carlo (MCMC) chains on a high-performance computing system for as many generations as it took to arrive at prior, posterior and likelihood effective sample size (ESS) values all above 200.An ESS of 200 corresponds to a standard error of the mean of 1.77% of the interval width [98].The ESS value was calculated and convergence assessed using the coda package [96] in R, and Tracer [99].Depending on the number of taxa, traits, moves and model specifications, MCMC analyses had to run-for each independent run-for a range between 348 500 000 and 22 820 000 000 generations before convergence was reached.Using logcombiner [8], we collated the log-files and the tree-files of each pair of converged, independent chains, with a 20% burn-in.All further analyses were conducted using the combined log-and tree-files.
To quantify and compare the influence of the different taxa-trait combinations on the clock rates, we visualized the median clock rate of the ULNC model, as well as the variance in the overall clock rate, on a taxa-trait grid using ggplot2 [100] (electronic supplementary material, figure S1; for the nCat2 model, see electronic supplementary material, figure S2).

Running the models under the prior
For 16, 32 and 87 taxa, we ran the ULNC skyline model with age uncertainty 'under the prior', which means we can obtain estimates of the diversification rates under the FBD model, while excluding the influence of the trait and clock models.By excluding the trait information, we are able to distinguish the signal that comes from the fossil sampling times, in the absence of any information from the traits.

Phylogenetic trees
Using the package TreeAnnotator [8], we calculated a maximum clade credibility (MCC) tree from each combined output tree file keeping the target heights.Based on the MCC trees, we extracted various metrics such as posterior clade probabilities (PP), median branch lengths and rates, and the lengths of the 95% HPD range of the node heights representing age uncertainty.We visualized these metrics across all different taxa-trait combinations for the skyline model with age uncertainty under the ULNC model (figure 6).

Increasing traits based on principal component axes decreases the evolutionary rate and variance estimates
For complex phylodynamic inference, every parameter and prior choice requires careful consideration, especially in the context of novel applications.We therefore tested how analyses using continuous traits behaved in relation to the different parameters and model choices, and to investigate, whether different models affect our ability to test certain hypotheses.Electronic supplementary material figures S1A and S2A, show that for the ULNC and nCat2 models respectively, the median clock rate (the overall rate at which traits evolve over time) is reduced with the addition of more traits.The highest median rates are recovered-with a few exceptions-from the dataset of 32 taxa and two traits.Electronic supplementary material, figure S1B, shows the clock rate variance (i.e. the degree to which rates vary across lineages) estimated under the ULNC relaxed clock model (see electronic supplementary material, figure S2B, for the nCat2 clock model).Here, the variance is highest for the subset containing 60 taxa and two traits.Meanwhile, the variance is highest for the smallest number of traits used (three for the ULNC model).The lowest variance in rates is recovered for the smallest subset of 16 taxa.We conclude that more traits contribute to an overall slower average rate of change.These patterns match what we would expect from figure 4a, where each additional trait (PC) adds less and less variation to the data.
Using the ULNC skyline model with age uncertainty, figure 6 summarizes different metrics derived from the MCC tree model across the different taxa-trait combinations.Figure 6(a(i)-(iii)) shows that the low PP values make it challenging to recover support for any specific topology.Further, the limited increase or even decrease in support with the addition of more taxa is partly attributed to the low PP values, resulting in fewer recovered monophyletic groups across the posterior.The observation of a slight increase with n taxa = 9 may be a result of semi-randomly subsampled taxa, but confirmation and further understanding would require simulations.The sub-plots in figure 6(b(i)-(iii)) visualize the length of the age uncertainty range.With an increase in taxa, the overall mean age uncertainty decreases, as well as the variance (figure 6(b(ii))).The addition of traits has the opposite effect because the clock rates also decrease with the addition of more traits (figure 6(b(iii))).The addition of traits seems to be particularly influential for datasets with few taxa (figure 6(b(i))).Median branch lengths increase with the addition of traits, yet the overall median branch rates decrease with the addition of taxa.Here too, the addition of traits seems to be particularly influential for datasets with few taxa (figure 6(c(i))).The addition of traits reduces both the median branch rates and their variance (figure 6(d(iii))).Overall, median branch rates are higher for datasets with more taxa (figure 6(d(i),(ii))).
Figure 7 shows the MCC tree under the ULNC skyline model with age uncertainty.For illustrative purposes, the dataset containing 16 taxa and nine traits was chosen.Like all other trees inferred for this study, posterior clade probabilities are low.Nonetheless, the advantage of the Bayesian inference of phylogenies-as conducted here-is that despite topological uncertainty we can use the data to analyse rates of change in a skyline analysis.

Skyline analysis and rates of cultural evolution
Figure 8 shows the death, birth, diversification and turnover rates for the four chosen climate time bins of the period before the end of GS−2, GI−1, GS−1 and the onset of the Holocene for the different combinations of taxa and traits from the skyline analysis with age uncertainty under the ULNC model (for the nCat2 model, see electronic supplementary material, figure S3).It shows that, although the Posterior clade probability  magnitude of rates varies, the overall trend is stable across all trait combinations.Note the early decline in the birth rate for all taxa-trait combinations in GI−1, except for the model using 16 taxa and nine traits, where the birth rate first drops in the time bin GS−1.When comparing the death rates between the different subsets of taxa, the median death rates and their variance are highest in GI−1 for the subset of 87 taxa, where they decrease again in the Holocene.These general tendencies can also be found when running the same models under the prior, which allows us to exclude the trait information, and study how much signal is generated from the age information per se (figure 9).This general consistency indicates that the signal for the FBD model parameters is, in fact, coming from the sampling times.Based on visual inspection alone, the results differ the most from the results obtained under prior for the median birth rates of the subset of 16 taxa and nine traits.
The results of the skyline analysis depict a clear signal, more apparent with the addition of taxa, as the higher birth rate for the period before the end of GS−2 could neither be guessed from the dates of our sites (figure 5) nor from the prior, which was set to an exponential distribution with a mean of 0.1.Hence, the results of the skyline analysis are robust to the number of taxa and traits.For readability, the tip labels state the name of the site and the abbreviated archaeological taxonomic unit to which they traditionally are assigned, derived from the original dataset.Sampled ancestors/zero length-edges are not collapsed.The tree was created using FigTree [101] and modified with the Inkscape software package [102].

Discussion
The experimental use of continuous traits for the inference of Bayesian phylogenies under a phylodynamic model (the FBD skyline) presented in this paper is novel in archaeology.Until recently, the phylogenetic study of archaeological artefacts has largely been limited to the inference of trees using either parsimony or maximum likelihood approaches, chiefly relying on discrete and often somewhat arbitrary traits.Not only do we infer a phylogenetic tree using continuous data within a Bayesian framework but we also incorporate information about the age of the artefacts by applying a model that incorporates sampling-through-time, the FBD process.In doing so, we were able to derive macroevolutionary metrics such as birth and death rates through time.The Bayesian framework formalizes such a model-based approach and, importantly, provides a measure of uncertainty for the results.
Given the novel nature of this approach, much effort was put into exploring the impact of the data on the different elements of our model, by adding complexity step-by-step.By benchmarking the different models and the impact of changing numbers of taxa and traits, we found, that-for these data, projectile point shapes from the European LUP in the form of PCs-the addition of many shape aspects (PCs) contributed to an overall decrease in average rates of change and their variances.Our results show that the number of taxa used has an impact on model performance as well.We see the reason behind this in what is visualized in figure 4a, where each additional trait (PC) added captures increasingly less variation in the data.Similar observations were made recently in the context of a study using principal components as phylogenetic characters of single-cell gene expression data [97].In our case, the conclusion to select as few traits as possible for the phylogenetic inference, to arrive at maximal evolutionary rates, and to decrease noise in the data, would, in our view, be premature, however.Choosing the model with the most taxa and the fewest traits, resulting in the highest rates, would unduly limit the sampled design space of the artefacts available, as additional traits describe distinct features of the LUP shape repertoire, such as 'tanged', 'shouldered' or 'backed' artefacts.The low posterior clade probabilities across our inferred phylogenies overall preclude any firm interpretations based on their topologies.Such topological uncertainty supports the argument that existing archaeological taxonomies of the LUP-often predominantly based on artefact shape-are not well-founded, as empirical studies have shown (e.g.[66]).That said, the broad contrast over time in birth and death rates across changing climates may reflect changing mobility strategies and hence population contact under cold and warm regimes, respectively.By the same token, and as already noted by [40], LUP armature outlines may in general be relatively poorly suited for these kinds of analyses due to the overall simplicity of the shapes and the seeming lack of normativity in form imposition at this time.In contrast, the results of a companion study, using the outline shapes of Late Neolithic/Early Bronze Age arrowheads deliberately shaped in the context of prestige display, achieved overall higher posterior clade probabilities [49].Phylogenetic analyses of lithic armatures may achieve greater resolution by combining matrices consisting of qualitative traits related to their manufacture and outline shape data.Although horizontal transmission and information sharing across cultural lineages could complicate the phylogenetic signal, it is not possible to test this hypothesis using our modelling framework.Future research examining this aspect of cultural evolution could explore the use of network-based inference tools (e.g.Neighbor-Net [103]).However, network analysis in combination with the FBD model is not currently possible, and using networks to visualise historical relations with considerable time-depth also harbours a range of conceptual challenges (cf.[104]).
Despite the topological uncertainties of the trees, we can legitimately deploy the inferred results to analyse the data at the macro-scale-a key advantage of Bayesian phylodynamic models.Birth and death rates were inferred through a skyline analysis for the four distinct climatological time bins representing warming and cooling events within the temporal scope of this study, that is the end of GS−2, GI−1, GS−1 and the onset of the Holocene.Birth rates were highest at the end of the Last Glacial Maximum (GS−2) and declined in the following time bins.High innovation rates under climate-induced pressure have been observed elsewhere [105,106], while the rapid expansion of hunter-gatherer communities at this time may also have led to isolation by distance.Death rates, in contrast, were lowest in GS−2 and increased in GI−1 and GS−1, before decreasing again at the beginning of the Holocene.The consistent association of artefacts whose traditional labels assign  GS-2 GI-1 GS-1 Holocene GS-2 GI-1 GS-1 Holocene GS-2 GI-1 GS-1 Holocene Greenland Stadials/Interstadials GS-2 GI-1 GS-1 Holocene GS-2 GI-1 GS-1 Holocene GS-2 GI-1 GS-  them to the so-called Epigravettian-a southeastern phenomenon-and the Azilian and Federmesser cultures-more northwestern phenomena-aligns well with recent palaeogenomic results [73] that would suggest a genetic relation between precisely those units.

Conclusion
This paper is the first effort in inferring Bayesian phylogenies from continuous lithic armature outline shape data under a phylodynamic model.The analysis conducted using the FBD skyline model is a first step into the realm of greatly expanded analytical possibilities for studying stone tool evolution in a macro-archaeological way.The reproducible and modular workflow provided here can now be used and expanded in a myriad of ways.For example, future studies could be dedicated to the evaluation of the most likely models of shape evolution and compare models other than Brownian motion [41,107], such as the Ornstein-Uhlenbeck process [108][109][110] or an Early Burst model [111,112].Furthermore, it could be tested whether the robustness of the phylogenies changes with the inclusion of additional data, such as biogeographic information, or in combination with other phylogenetic sources, such as language-or ancient DNA-based phylogenies.Combining qualitative technological traits and shape data would also doubtlessly improve phylogenetic inference, although such data remain difficult to obtain in adequately large sample sizes and lingering issues of intra-and inter-observer variability remain [113].The choice of artefacts could be evaluated, too, and more evidently standardized artefacts than the ones of the LUP, such as the lithic projectile points of the Neolithic, or even pottery profiles could be taken into service.Once robust artefact phylogenies are in place, the extensive analytical repertoire that phylogenetic comparative methods offer, such as testing the association between artefact shape and their size, and/or with favoured prey species as inferred from associated faunal data, could also be taken into use.Finally, the reconstruction of explicit artefact phylodynamics as explored here offers exciting opportunities for analysing genetic and artefactual data in parallel.Both approaches are underwritten by explicit generative models of change-genetic and cultural evolution respectively-and share common methodological features.It might thereby become possible to conduct explicit gene-culture co-evolutionary analyses in these very remote periods of the human past.

Figure 3 .
Figure 3. Panel visualizing the 87 selected artefacts, chronologically separated according to the Dansgaard-Oeschger Events (see figure1).The artefacts have been subsampled from the database published in[69], reoriented, centred and scaled.Each of the artefacts represents one unique key site/layer combination for which radiocarbon dates were available.In this visualization, the scales vary between the sub-plots.

Figure 4 .
Figure 4. (a) The cumulative percentage of variance explained by each individual PC axis.Here, the first nine PC axes capture 99% of the total shape variation.(b) The shape variation along the PC axes.

Figure 5 .
Figure 5.All selected key site/layer combinations with their associated median ages (dots) and age uncertainties (horizontal lines).Each date is derived from the calibrated SPD of all available radiocarbon dates for each site/layer combination.For models with age uncertainty, the one-sigma ranges of the SPDs were taken as the minimum and maximum ages.Otherwise, the median ages of the SPDs were used.

Figure 6 .
Figure 6.Metrics derived from each MCC tree based on the ULNC skyline model with age uncertainty.(a) The posterior clade probabilities, (b) the lengths of the 95% HPD range of the node heights, (c) the median branch lengths, and (d) the median branch rates.In sub-plots (i), the data are shown for each trait set (x-axis), grouped by the number of taxa (16, 32, 60, 87).In (ii), the plot shows all values separated by taxa, summarizing the values across traits.In (iii), all values are separated by trait sets, summarizing the values across taxa.The black dots represent the median, and the red dots the mean.The HPD intervals are represented as vertical bars.

Figure 7 .
Figure 7. Time-scaled MCC tree of the dataset with 16 taxa and the first nine PC axes as traits using the skyline model with age uncertainty and the ULNC clock model.The edges are coloured according to their median clock rate.The horizontal grey bars show the 95% HPD of the ages.PP values are shown for each clade.For readability, the tip labels state the name of the site and the abbreviated archaeological taxonomic unit to which they traditionally are assigned, derived from the original dataset.Sampled ancestors/zero length-edges are not collapsed.The tree was created using FigTree[101] and modified with the Inkscape software package[102].

Figure 8 .
Figure 8. Results of the skyline analysis with age uncertainty using the ULNC model for 16, 32 and 87 taxa with 2, 9 and 44 traits.Visualized are the 95% HPD intervals and medians for the (a) death, (b) birth, (c) diversification and (d) turnover rates for the four major climatic periods of the Greenland ice-core event stratigraphy[57].The scale of the y-axes differs between each of the four sub-plots.

Figure 9 .
Figure 9. Results from under the prior of the skyline analysis with age uncertainty using the ULNC clock model for 16, 32 and 87 taxa with 2, 9 and 44 traits.Visualized are the 95% HPD intervals and medians for the (a) death, (b) birth, (c) diversification and (d) turnover rates for the four major climatic periods of the Greenland ice-core event stratigraphy [57].See figure 8.